Add Vector Search semantic product discovery example#153
Conversation
Demonstrates a Direct Access Vector Search index and endpoint declared
as bundle resources (vector_search_endpoints, vector_search_indexes),
tested e2e against staging with the direct engine.
Key design decisions:
- Jobs use resource references (${resources.*.name}) for endpoint and
index names so dev-mode prefixing flows through automatically
- schema_json uses flat {"col":"type"} format required by the API
- Notebooks embed descriptions/queries explicitly (Direct Access indexes
don't auto-embed; that's a Delta Sync feature)
- engine: direct set in bundle config so no env var is needed
Co-authored-by: Isaac
Co-authored-by: Isaac
pietern
left a comment
There was a problem hiding this comment.
Ran this example end-to-end on a dogfood workspace with the released CLI (v1.1.0): validate → deploy → run (setup + query) → destroy. The embed → upsert → similarity-search logic is correct — all three README example queries returned the documented top result, so the substance is solid. Also confirmed v1.1.0 recognizes vector_search_endpoints / vector_search_indexes, so the cli#5123 dependency has shipped (correctly struck through in the description).
Nice to see the index name reference ${resources.schemas.product_search_schema.name} rather than the raw ${var.schema} — that's the mode-prefix-safe form.
Remaining feedback is about per-deploy isolation and the CLI run experience, flagged inline. Nothing blocks the single-user happy path; it's mostly "what happens when a second person deploys this into the same workspace."
| resources: | ||
| vector_search_endpoints: | ||
| product_search_endpoint: | ||
| name: ${var.endpoint_name} |
There was a problem hiding this comment.
The endpoint name is a hardcoded, workspace-global value with no per-deploy uniqueness. Unlike jobs/schemas, vector_search_endpoints names aren't rewritten by any deployment mode, so every deploy of this example tries to create product-search-endpoint. A second user — or a second copy — gets 409 ALREADY_EXISTS; I hit exactly this against an existing endpoint while testing. The README already notes it "must be unique per workspace," but nothing in the bundle makes it so.
Options: treat the endpoint as a shared prerequisite/variable (endpoints are slow to provision and are designed to host many indexes), or bake something unique into the default (e.g. ${workspace.current_user.short_name}). This compounds with the single prod target (see the comment on databricks.yml).
| prod: | ||
| mode: production | ||
| default: true |
There was a problem hiding this comment.
The bundle ships only a prod / mode: production target, set as the default. For a copy-me example that's worth reconsidering: mode: production applies no name prefixing, so a plain databricks bundle deploy creates unprefixed, shared-namespace resources — the product-search-endpoint endpoint, the main.product_search schema, and main.product_search.product_index. Two people trying the quickstart collide on all three, and it writes into a generic schema in main.
Most examples default to a dev target with mode: development, so the first thing a user runs produces an isolated, prefixed copy. Consider adding a dev default target (and/or namespacing the schema and endpoint), keeping prod for the production story.
| } | ||
| embedding_vector_columns: | ||
| - name: description_vector | ||
| embedding_dimension: 1024 |
There was a problem hiding this comment.
embedding_dimension has two sources of truth. databricks.yml defines an embedding_dimension variable (default "1024") and threads it through both notebooks, but this line hardcodes 1024 independently — and the value is immutable after index creation. An override like --var embedding_dimension=512 would silently produce vectors this index rejects. Consider referencing the variable so there's one knob:
| embedding_dimension: 1024 | |
| embedding_dimension: ${var.embedding_dimension} |
The variable defaults to the string "1024" (fine as a job param) while this field is an integer, so confirm the interpolation coerces — or declare the variable's type as integer.
| rows = results["result"]["data_array"] | ||
| df = pd.DataFrame(rows, columns=result_columns) | ||
| df.index += 1 | ||
| print(df.to_string()) |
There was a problem hiding this comment.
Query results aren't visible from databricks bundle run — the path the README leads with. print(df.to_string()) only reaches the notebook cell output; bundle run product_discovery_query shows just RUNNING / TERMINATED SUCCESS, and jobs get-run-output comes back empty because the notebook never calls dbutils.notebook.exit(). For a demo whose payoff is the ranked list, consider also exiting with the result:
dbutils.notebook.exit(df.to_json(orient="records"))(Keep the print for interactive use.) Otherwise the README should note that you open the run URL to see the output.
There was a problem hiding this comment.
Don't use real brands - make these fake - otherwise it can look like we officially endorse them.
| @@ -0,0 +1,157 @@ | |||
| # Vector Search: Semantic Product Discovery | |||
|
|
|||
| A Declarative Automation Bundle demonstrating **semantic product search** using | |||
There was a problem hiding this comment.
| A Declarative Automation Bundle demonstrating **semantic product search** using | |
| A Declarative Automation Bundle demonstrating semantic product search using |
Nit. There's no reason to bold this.
| # Vector Search: Semantic Product Discovery | ||
|
|
||
| A Declarative Automation Bundle demonstrating **semantic product search** using | ||
| [Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html). |
There was a problem hiding this comment.
I have heard "Vector Search" is being renamed?
| Keyword search fails when shoppers use different words than what appears in product | ||
| descriptions. A customer searching for *"something to keep my coffee hot all day"* won't | ||
| match a product described as an *"insulated stainless water bottle with double-wall vacuum | ||
| insulation"* — even though it's the right answer. |
There was a problem hiding this comment.
| insulation"* — even though it's the right answer. | |
| insulation" even though it's the right answer. |
| ## The problem | ||
|
|
||
| Keyword search fails when shoppers use different words than what appears in product | ||
| descriptions. A customer searching for *"something to keep my coffee hot all day"* won't |
There was a problem hiding this comment.
| descriptions. A customer searching for *"something to keep my coffee hot all day"* won't | |
| descriptions. A customer searching for "something to keep my coffee hot all day" won't |
|
|
||
| Keyword search fails when shoppers use different words than what appears in product | ||
| descriptions. A customer searching for *"something to keep my coffee hot all day"* won't | ||
| match a product described as an *"insulated stainless water bottle with double-wall vacuum |
There was a problem hiding this comment.
| match a product described as an *"insulated stainless water bottle with double-wall vacuum | |
| match a product described as an "insulated stainless water bottle with double-wall vacuum |
| match a product described as an *"insulated stainless water bottle with double-wall vacuum | ||
| insulation"* — even though it's the right answer. | ||
|
|
||
| Semantic search using vector embeddings matches on **meaning**, not words. |
There was a problem hiding this comment.
| Semantic search using vector embeddings matches on **meaning**, not words. | |
| Semantic search using vector embeddings matches on meaning, not words. |
| ## Prerequisites | ||
|
|
||
| - Databricks workspace with Unity Catalog enabled | ||
| - Databricks CLI that supports `vector_search_endpoints` / `vector_search_indexes` as bundle resources |
There was a problem hiding this comment.
I would just put the version unless there is a reason not to?
| - Databricks CLI that supports `vector_search_endpoints` / `vector_search_indexes` as bundle resources | |
| - Databricks CLI version 1.1.0 or above |
| - Databricks CLI that supports `vector_search_endpoints` / `vector_search_indexes` as bundle resources | ||
| - An existing Unity Catalog catalog (default: `main`) | ||
|
|
||
| ## Quick start |
There was a problem hiding this comment.
I'd probably call this Usage?
| databricks auth login --host https://your-workspace.cloud.databricks.com | ||
| ``` | ||
|
|
||
| 2. **Configure** `databricks.yml` — set the workspace host and any variable overrides |
There was a problem hiding this comment.
| 2. **Configure** `databricks.yml` — set the workspace host and any variable overrides | |
| 2. Configure `databricks.yml`. Set the workspace host and any variable overrides. |
I'm not a fan of the bolded style of these steps, but minor nit.
|
|
||
| ## Quick start | ||
|
|
||
| 1. **Authenticate** |
There was a problem hiding this comment.
| 1. **Authenticate** | |
| 1. Authenticate the CLI: |
|
|
||
| 2. **Configure** `databricks.yml` — set the workspace host and any variable overrides | ||
|
|
||
| 3. **Deploy** — creates the schema, endpoint, index, jobs, and syncs `data/products.json` |
There was a problem hiding this comment.
| 3. **Deploy** — creates the schema, endpoint, index, jobs, and syncs `data/products.json` | |
| 3. Deploy the bundle. This creates the schema, endpoint, index, jobs, and syncs `data/products.json`. |
| ``` | ||
| > Vector Search endpoint creation takes a few minutes to reach ONLINE status. | ||
|
|
||
| 4. **Load the catalog** — embeds all product descriptions and upserts them into the index |
There was a problem hiding this comment.
| 4. **Load the catalog** — embeds all product descriptions and upserts them into the index | |
| 4. Load the catalog by running the bundle. This embeds all product descriptions and upserts them into the index. |
| databricks bundle run product_discovery_setup | ||
| ``` | ||
|
|
||
| 5. **Search** — pass any natural-language query |
There was a problem hiding this comment.
| 5. **Search** — pass any natural-language query | |
| 5. Pass any natural-language query to search. |
| databricks bundle run product_discovery_query --params "query=footwear for slippery wet trails" | ||
| ``` | ||
|
|
||
| 6. **Or open** `src/02_query_demo.py` in your workspace to run queries interactively |
There was a problem hiding this comment.
| 6. **Or open** `src/02_query_demo.py` in your workspace to run queries interactively | |
| 6. Or open `src/02_query_demo.py` in your workspace to run queries interactively: |
| ## Bundle resources | ||
|
|
||
| | Resource | Type | Description | | ||
| |---|---|---| | ||
| | `product_search_schema` | `schemas` | Unity Catalog schema that namespaces the index | | ||
| | `product_search_endpoint` | `vector_search_endpoints` | Managed ANN serving endpoint | | ||
| | `product_index` | `vector_search_indexes` | Direct Access index — schema defined in `resources/index.yml` | | ||
| | `product_discovery_setup` | `jobs` | Embeds product descriptions and upserts into the index | | ||
| | `product_discovery_query` | `jobs` | Embeds a query and returns ranked results | |
There was a problem hiding this comment.
This section feels misplaced - this is part of the project structure below. And in fact, maybe delete this or merge the two?
There was a problem hiding this comment.
Or actually maybe the opposite - put the project structure here because that is a nice overview and then merge the descriptions with that?
| `direct_access_index_spec` with `index_type: DELTA_SYNC` and `delta_sync_index_spec` in | ||
| `resources/index.yml`, and remove the upsert job. | ||
|
|
||
| ## Project structure |
There was a problem hiding this comment.
Move this up higher? It sure seems like you want this info before all the descriptions of the resources/files.
| A Declarative Automation Bundle demonstrating **semantic product search** using | ||
| [Databricks Vector Search](https://docs.databricks.com/en/generative-ai/vector-search.html). | ||
|
|
||
| ## The problem |
There was a problem hiding this comment.
I would probably put this as the first paragraph of "How it works" instead of a separate "The problem" section.
| @@ -0,0 +1,157 @@ | |||
| # Vector Search: Semantic Product Discovery | |||
|
|
|||
| A Declarative Automation Bundle demonstrating **semantic product search** using | |||
There was a problem hiding this comment.
I think I get it after reading this whole page and usage, etc, but I think it might have helped if this first sentence expanded a bit on what using a bundle helps me do here (even if it's obvious that it is automating setup - say that).
There was a problem hiding this comment.
Just calling it out: users will take the files names here and copy directly when they use this example to create their own templates with this resource. So is "endpoint.yml" a good (best practice) file name to contain vector search endpoint definitions? (And same for others.)
Summary
Adds a Declarative Automation Bundle under
contrib/vector_search_product_discovery/that demonstrates semantic product search end-to-end with Databricks Vector Search:vector_search_endpoints+vector_search_indexesdeclared as bundle resources, with jobs referencing them via${resources.*.name}so dev-mode prefixing flows through automaticallyengine: directindatabricks.yml); descriptions are embedded explicitly in01_upsert_products.pyand the query notebook embeds the query before callingsimilarity_search— Direct Access indexes don't auto-embed (that's a Delta Sync feature)schema_jsonuses the flat{"col":"type"}form required by the APIDependency
Requires databricks/cli#5123 (still open), which landsvector_search_indexesas a first-class DABs resource on the direct engine. Until that PR merges and ships in a CLI release,databricks bundle deployagainst this example will fail to recognize thevector_search_indexesresource type.Test plan
databricks bundle validateagainst a CLI built from Add vector_search_indexes resource (direct engine) cli#5123databricks bundle plandatabricks bundle deploy— endpoint reaches ONLINE, index createddatabricks bundle run product_discovery_setup— products embedded and upserteddatabricks bundle run product_discovery_query --params "query=footwear for slippery wet trails"— returns ranked resultsdatabricks bundle destroy— clean teardownThis pull request and its description were written by Isaac.